{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "<center>\n",
    "<img src=\"https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/img/ods_stickers.jpg\" />\n",
    "\n",
    "<h1>Open Machine Learning Course</h1>\n",
    "\n",
    "\n",
    "<br><br>\n",
    "\n",
    "<center>\n",
    "Auteur: [Yury Kashnitskiy](https://yorko.github.io).  \n",
    "Traduit et édité par [Ousmane Cissé](https://www.linkedin.com/in/ousmane-ciss%C3%A9/).  \n",
    "Ce matériel est soumis aux termes et conditions de la licence [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).  \n",
    "L'utilisation gratuite est autorisée à des fins non commerciales.\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "en"
   },
   "source": [
    "# <center>Topic 7. Apprentissage non supervisé\n",
    "## <center>Part 1. Analyse en Composantes Principales\n",
    "    \n",
    "**Il s'agit principalement de démontrer certaines applications de l'ACP, pour la théorie, l'étude [sujet 7](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic07_unsupervised/topic7_pca_clustering.ipynb?flush_cache=true) dans notre cours. <br> Notebook utilisé dans l'enregistrement de la leçon 7, [partie 1](https://youtu.be/-AswHf7h0I4).**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Décomposition en valeurs singulières (SVD) de la matrice $X$:\n",
    "\n",
    "$$\\Large X = UDV^T,$$\n",
    "\n",
    "où $U \\in R^{m \\times m}$, $V \\in R^{n \\times n}$ - sont des matrices orthogonales et $D \\in R^{m \\times n}$ - est une matrice diagonale\n",
    "\n",
    "<img src='https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/img/svd_diag_matrix.png' width=70%>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Analyse en composantes principales\n",
    "1. Définir $k<d$ - une nouvelle dimension\n",
    "2. Échelle $X$:\n",
    " - changer $\\Large x^{(i)}$ en $\\Large  x^{(i)} - \\frac{1}{m} \\sum_{i=1}^{m}{x^{(i)}}$\n",
    " - $\\Large  \\sigma_j^2 = \\frac{1}{m} \\sum_{i=1}^{m}{(x^{(i)})^2}$\n",
    " - changer $\\Large x_j^{(i)}$ en $\\Large \\frac{x_j^{(i)}}{\\sigma_j}$\n",
    "4. Trouvez la décomposition en valeurs singulières de $X$:\n",
    "$$\\Large X = UDV^T$$\n",
    "5. Définissez $V =$ [$k$ colonnes les plus à gauche de $V$]\n",
    "6. Retourne $$\\Large Z = XV \\in \\mathbb{R}^{m \\times k}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Exemple 2D"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(0)\n",
    "mean = np.array([0.0, 0.0])\n",
    "cov = np.array([[1.0, -1.0], \n",
    "                [-2.0, 3.0]])\n",
    "X = np.random.multivariate_normal(mean, cov, 300)\n",
    "\n",
    "pca = PCA()\n",
    "pca.fit(X)\n",
    "print('Proportion of variance explained by each component:\\n' +\\\n",
    "      '1st component - %.2f,\\n2nd component - %.2f\\n' %\n",
    "      tuple(pca.explained_variance_ratio_))\n",
    "print('Directions of principal components:\\n' +\\\n",
    "      '1st component:', pca.components_[0],\n",
    "      '\\n2nd component:', pca.components_[1])\n",
    "\n",
    "plt.figure(figsize=(10,10))\n",
    "plt.scatter(X[:, 0], X[:, 1], s=50, c='r')\n",
    "for l, v in zip(pca.explained_variance_ratio_, pca.components_):\n",
    "    d = 5 * np.sqrt(l) * v\n",
    "    plt.plot([0, d[0]], [0, d[1]], '-k', lw=3)\n",
    "plt.axis('equal')\n",
    "plt.title('2D normal distribution sample and its principal components')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "1ère composante principale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep enough components to explain 90% of variance\n",
    "pca = PCA(0.90)\n",
    "X_reduced = pca.fit_transform(X)\n",
    "\n",
    "# Map the reduced data into the initial feature space\n",
    "X_new = pca.inverse_transform(X_reduced)\n",
    "\n",
    "plt.figure(figsize=(10,10))\n",
    "plt.plot(X[:, 0], X[:, 1], 'or', alpha=0.3)\n",
    "plt.plot(X_new[:, 0], X_new[:, 1], 'or', alpha=0.8)\n",
    "plt.axis('equal')\n",
    "plt.title('Projection of sample onto the 1st principal component')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Nous avons réduit de moitié la dimensionnalité de l'espace des variables, tout en conservant les variables les plus importantes. C'est l'idée de base de la réduction de la dimensionnalité - pour approximer un ensemble de données multidimensionnelles en utilisant des données d'une dimension plus petite, tout en préservant autant d'informations sur l'ensemble de données que possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "# Visualisation de données de grande dimension"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "#### Exemple d'iris"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "X_reduced = pca.fit_transform(X)\n",
    "\n",
    "print(\"Meaning of the 2 components:\")\n",
    "for component in pca.components_:\n",
    "    print(\" + \".join(\"%.3f x %s\" % (value, name)\n",
    "                     for value, name in zip(component,\n",
    "                                            iris.feature_names)))\n",
    "plt.figure(figsize=(10,7))\n",
    "plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=70, cmap='viridis')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Chiffres manuscrits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "\n",
    "digits = load_digits()\n",
    "X = digits.data\n",
    "y = digits.target\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "X_reduced = pca.fit_transform(X)\n",
    "\n",
    "print('Projecting %d-dimensional data to 2D' % X.shape[1])\n",
    "\n",
    "plt.figure(figsize=(12,10))\n",
    "plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, \n",
    "            edgecolor='none', alpha=0.7, s=40,\n",
    "            cmap=plt.cm.get_cmap('nipy_spectral', 10))\n",
    "plt.colorbar()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Jetons un coup d'œil aux 2 premières composantes principales"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)\n",
    "\n",
    "im = pca.components_[0]\n",
    "ax1.imshow(im.reshape((8, 8)), cmap='binary')\n",
    "ax1.set_xticks([])\n",
    "ax1.set_yticks([])\n",
    "ax1.set_title('First principal component')\n",
    "\n",
    "im = pca.components_[1]\n",
    "ax2.imshow(im.reshape((8, 8)), cmap='binary')\n",
    "ax2.set_xticks([])\n",
    "ax2.set_yticks([])\n",
    "ax2.set_title('Second principal component')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### Compression des données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(4,2))\n",
    "plt.imshow(X[15].reshape((8, 8)), cmap='binary')\n",
    "plt.xticks([])\n",
    "plt.yticks([])\n",
    "plt.title('Source image')\n",
    "plt.show()\n",
    "\n",
    "fig, axes = plt.subplots(8, 8, figsize=(10, 10))\n",
    "fig.subplots_adjust(hspace=0.1, wspace=0.1)\n",
    "\n",
    "for i, ax in enumerate(axes.flat):\n",
    "    pca = PCA(i + 1).fit(X)\n",
    "    im = pca.inverse_transform(pca.transform(X[15].reshape(1, -1)))\n",
    "\n",
    "    ax.imshow(im.reshape((8, 8)), cmap='binary')\n",
    "    ax.text(0.95, 0.05, 'n = {0}'.format(i + 1), ha='right',\n",
    "            transform=ax.transAxes, color='red')\n",
    "    ax.set_xticks([])\n",
    "    ax.set_yticks([])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Combien de composantes choisir? En règle générale, ils laissent au moins 90% de la variance des données à conserver."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA().fit(X)\n",
    "\n",
    "plt.figure(figsize=(10,7))\n",
    "plt.plot(np.cumsum(pca.explained_variance_ratio_), color='k', lw=2)\n",
    "plt.xlabel('Number of components')\n",
    "plt.ylabel('Total explained variance')\n",
    "plt.xlim(0, 63)\n",
    "plt.yticks(np.arange(0, 1.1, 0.1))\n",
    "plt.axvline(21, c='b')\n",
    "plt.axhline(0.9, c='r')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(0.9).fit(X)\n",
    "print('We need %d components to explain 90%% of variance' \n",
    "      % pca.n_components_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "### PCA comme prétraitement"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "PCA sert souvent de technique de prétraitement. Jetons un coup d'œil à une tâche de reconnaissance faciale"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "lfw_people = datasets.fetch_lfw_people(min_faces_per_person=50, \n",
    "                resize=0.4, data_home='../../data/faces')\n",
    "\n",
    "print('%d objects, %d features, %d classes' % (lfw_people.data.shape[0],\n",
    "      lfw_people.data.shape[1], len(lfw_people.target_names)))\n",
    "print('\\nPersons:')\n",
    "for name in lfw_people.target_names:\n",
    "    print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Distribution de la classe cible:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, name in enumerate(lfw_people.target_names):\n",
    "    print(\"{}: {} photos.\".format(name, (lfw_people.target == i).sum()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(8, 6))\n",
    "\n",
    "for i in range(15):\n",
    "    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n",
    "    ax.imshow(lfw_people.images[i], cmap='bone')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = \\\n",
    "    train_test_split(lfw_people.data, lfw_people.target, random_state=0)\n",
    "\n",
    "print('Train size:', X_train.shape[0], 'Test size:', X_test.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Ici, nous avons recours à «RandomizedPCA» (un algorithme un peu plus efficace) pour trouver les 100 premières composantes principales qui conservent> 90% de variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(n_components=100, svd_solver='randomized')\n",
    "pca.fit(X_train)\n",
    "\n",
    "print('100 principal components explain %.2f%% of variance' %\n",
    "      (100 * np.cumsum(pca.explained_variance_ratio_)[-1]))\n",
    "plt.figure(figsize=(10,7))\n",
    "plt.plot(np.cumsum(pca.explained_variance_ratio_), lw=2, color='k')\n",
    "plt.xlabel('Number of components')\n",
    "plt.ylabel('Total explained variance')\n",
    "plt.xlim(0, 100)\n",
    "plt.yticks(np.arange(0, 1.1, 0.1))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Premières composantes principales. Belles variables!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(16, 6))\n",
    "for i in range(30):\n",
    "    ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])\n",
    "    ax.imshow(pca.components_[i].reshape((50, 37)), cmap='bone')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "en"
   },
   "source": [
    "\"Mean face\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.imshow(pca.mean_.reshape((50, 37)), cmap='bone')\n",
    "plt.xticks([])\n",
    "plt.yticks([])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Examinons maintenant la classification après avoir réduit la dimensionnalité de 1850 variables à seulement 100.\n",
    "Nous adapterons un classificateur softmax, alias une régression logistique multinomiale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "X_train_pca = pca.transform(X_train)\n",
    "X_test_pca = pca.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = LogisticRegression(multi_class='multinomial', \n",
    "                         random_state=17, solver='lbfgs', \n",
    "                         max_iter=10000)\n",
    "clf.fit(X_train_pca, y_train)\n",
    "y_pred = clf.predict(X_test_pca)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import (accuracy_score, classification_report,\n",
    "                             confusion_matrix)\n",
    "\n",
    "print(\"Accuracy: %f\" % accuracy_score(y_test, y_pred))\n",
    "print(classification_report(y_test, y_pred, target_names=lfw_people.target_names))\n",
    "\n",
    "M = confusion_matrix(y_test, y_pred)\n",
    "M_normalized = M.astype('float') / M.sum(axis=1)[:, np.newaxis]\n",
    "\n",
    "plt.figure(figsize=(10,10))\n",
    "im = plt.imshow(M_normalized, interpolation='nearest', cmap='Greens')\n",
    "plt.colorbar(im, shrink=0.71)\n",
    "tick_marks = np.arange(len(lfw_people.target_names))\n",
    "plt.xticks(tick_marks - 0.5, lfw_people.target_names, rotation=45)\n",
    "plt.yticks(tick_marks, lfw_people.target_names)\n",
    "plt.tight_layout()\n",
    "plt.ylabel('True person')\n",
    "plt.xlabel('Predicted person')\n",
    "plt.title('Normalized confusion matrix')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Comparons ce résultat avec le cas lorsque nous formons le modèle avec toutes les 1850 variables initiales."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "clf = LogisticRegression(multi_class='multinomial', \n",
    "                         random_state=17, solver='lbfgs', \n",
    "                         max_iter=10000)\n",
    "clf.fit(X_train, y_train)\n",
    "y_pred_full = clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Accuracy: %f\" % accuracy_score(y_test, y_pred_full))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Eh bien ... cela n'a pas fonctionné comme un bon exemple d'illustration, la régression logistique est toujours assez rapide avec 1850 variables, mais la précision est de 79%, ce qui est remarquablement supérieur à 72%. Mais pour les images de plus haute résolution et avec des modèles plus lourds, la différence de temps d'entraînement sera beaucoup plus remarquable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Quelques références\n",
    "- [PCA wiki page](https://en.wikipedia.org/wiki/Principal_component_analysis)\n",
    "- [PCA in 3 steps](http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html)\n",
    "- [SVD wiki page](https://en.wikipedia.org/wiki/Singular_value_decomposition)\n",
    "- [Eigenface](https://en.wikipedia.org/wiki/Eigenface)"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "name": "lesson8_part2_PCA.ipynb",
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "548px",
    "left": "76px",
    "top": "176px",
    "width": "307px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
